32 research outputs found

    Exploiting Monotone Convergence Functions in Parallel Programs

    Get PDF
    Scientific codes which use iterative methods are often difficult to parallelize well. Such codes usually contain \code{while} loops which iterate until they converge upon the solution. Problems arise since the number of iterations cannot be determined at compile time, and tests for termination usually require a global reduction and an associated barrier. We present a method which allows us avoid performing global barriers and exploit pipelined parallelism when processors can detect non-convergence from local information. (Also cross-referenced as UMIACS-TR-96-31.1

    Parallelizing Julia with a Non-Invasive DSL (Artifact)

    Get PDF
    This artifact is based on ParallelAccelerator, an embedded domain-specific language (DSL) and compiler for speeding up compute-intensive Julia programs. In particular, Julia code that makes heavy use of aggregate array operations is a good candidate for speeding up with ParallelAccelerator. ParallelAccelerator is a non-invasive DSL that makes as few changes to the host programming model as possible

    Parallelizing Julia with a Non-Invasive DSL

    Get PDF
    Computational scientists often prototype software using productivity languages that offer high-level programming abstractions. When higher performance is needed, they are obliged to rewrite their code in a lower-level efficiency language. Different solutions have been proposed to address this trade-off between productivity and efficiency. One promising approach is to create embedded domain-specific languages that sacrifice generality for productivity and performance, but practical experience with DSLs points to some road blocks preventing widespread adoption. This paper proposes a non-invasive domain-specific language that makes as few visible changes to the host programming model as possible. We present ParallelAccelerator, a library and compiler for high-level, high-performance scientific computing in Julia. ParallelAccelerator\u27s programming model is aligned with existing Julia programming idioms. Our compiler exposes the implicit parallelism in high-level array-style programs and compiles them to fast, parallel native code. Programs can also run in "library-only" mode, letting users benefit from the full Julia environment and libraries. Our results show encouraging performance improvements with very few changes to source code required. In particular, few to no additional type annotations are necessary

    Transitive Closure of Infinite Graphs and its Applications

    Get PDF
    Integer tuple relations can concisely summarize many types of information gathered from analysis of scientific codes. For example they can be used to precisely describe which iterations of a statement are data dependent of which other iterations. It is generally not possible to represent these tuple relations by enumerating the related pairs of tuples. For example, it is impossible to enumerate the related pairs of tuples in the relation {[i] -> [i+2] | 1 <= i <= n-2}. Even when it is possible to enumerate the related pairs of tuples, such as for the relation {[i,j] -> [i',j'] | 1 <= i,j,i',j' <= 100}, it is often not practical to do so. We instead use a closed form description by specifying a predicate consisting of affine constraints on the related pairs of tuples. As we just saw, these affine constraints can be parameterized, so what we are really describing are infinite families of relations (or graphs). Many of our applications of tuple relations rely heavily on an operation called transitive closure. Computing the transitive closure of these "infinite graphs" is very different from the traditional problem of computing the transitive closure of a graph whose edges can be enumerated. For example, the transitive closure of the first relation above is the relation {[i] -> [i'] | exists beta s.t. i'-i = 2beta and 1 <= i <= i' <= n}. As we will prove, this computation is not computable in the general case. We have developed algorithms that produce exact results in most commonly occurring cases and produce upper or lower bounds (as necessary) in the other cases. This paper will describe our algorithms for computing transitive closure and some of its applications such as determining which inter-processor synchronizations are redundant. (Also cross-referenced as UMIACS-TR-95-48

    Compiler Support for Sparse Tensor Computations in MLIR

    Full text link
    Sparse tensors arise in problems in science, engineering, machine learning, and data analytics. Programs that operate on such tensors can exploit sparsity to reduce storage requirements and computational time. Developing and maintaining sparse software by hand, however, is a complex and error-prone task. Therefore, we propose treating sparsity as a property of tensors, not a tedious implementation task, and letting a sparse compiler generate sparse code automatically from a sparsity-agnostic definition of the computation. This paper discusses integrating this idea into MLIR

    Generating Efficient Stack Code for Java

    No full text
    Optimizing Java byte code is complicated by the fact that it uses a stack-based execution model. Changing the intermediate representation from the stack-based to the register-based one brings the problem of Java byte code optimizations into well-studied domain of compiler optimizations for registerbased codes. In this paper we describe the technique to convert a register-based code into the Java byte code. The code generation techniques developed for the stack-based computers are not directly applicable to this problem as the comparative cost of the local memory and stack manipulation instructions in JVM is quite different from that in the stack-based computers. Naive verbose translation of the registerbased code into the Java byte code produces the code with many redundant store and load instructions. The tool that we have developed allows to remove 90-100 % of the stores to the local (i.e., non-global) variables. It produces the Java byte code that is slightly faster and shorter than..

    On Fast Array Data Dependence Tests

    No full text
    Array data-dependence analysis is an important part of any optimizing compiler for scientific programs. The Omega test is an exact test for integer solutions to affine constraints and can be used for array data dependence. There are other tests that are less exact but are intended to be faster. Many of these less exact tests are rather complicated and designed to be as accurate as possible while still being fast. In this paper, we describe the Epsilon test, intended to be as simple and as fast as possible, while not being embarrassingly inaccurate. We explore the relative speed and accuracy of the Epsilon and Omega test, and discuss how they might be joined. We also point out serious errors in recent published work on array data dependence tests. 1 Introduction Array data dependence analysis is an important part of any optimizing compiler for scientific programs. Consider the following code fragment: for i = 1 to n do 1: a[i] := ... 2: ... := a[i-1] At each iteration the read reaches..